One database, many dictionaries — varying co(n)text with the dictionary application TshwaneLex
نویسندگان
چکیده
This paper provides background information for a software demonstration of TshwaneLex, during which the actual use of the application is illustrated in real time. The focus of the demonstration is on two main aspects, together with a related aspect in each case, of particular interest to the ASIALEX 2005 conference. These are full Unicode support and customisable sorting on the one hand, and advanced DTD (Document Type Definition) aspects and Linked View mode on the other. Together, they provide the backbone for the claim that a single TshwaneLex database successfully provides for multiple dictionaries. The dictionary compilation software TshwaneLex TshwaneDJe HLT has been producing the dictionary compilation software TshwaneLex since 2002. In addition to most of the eleven South African National Lexicography Units who are currently using TshwaneLex – both for the compilation itself of their (monolingual) dictionaries, as well as for the presentation of their results on the Web – this software is now also used at a number of widely respected dictionary publishing houses such as Oxford University Press, Macmillan and Van Dale Lexicografie. Government-sponsored research centres, such as the Royal National Academy of Medicine in Spain, have also begun to build their latest reference databases around TshwaneLex. Copies of TshwaneLex have furthermore also been acquired by a variety of dictionary teams worldwide, who are compiling dictionaries for amongst others Lingála, Cilubà and Kiswahili (all spoken in Africa), Welsh, Irish and Estonian (all lesser-known European languages), Bai and Chinese (both spoken in China), Motu (an Austronesian language used in Papua New-Guinea), and Inezeño Chumash (a Native-American language from the US). Each of those languages needs its own script, and each of those projects needs its own dictionary grammar, both of which TshwaneLex provides for. A general introduction to TshwaneLex, with a focus on a selection of lexicographic underpinnings, may be found in Joffe & De Schryver (2004), while an example of an online application that revolves around TshwaneLex has been described in De Schryver & Joffe (2004). As pointed out in those publications, TshwaneLex contains numerous unique and highly developed lexicographic features. For example: ONE DATABASE, MANY DICTIONARIES — VARYING CO(N)TEXT WITH THE DICTIONARY APPLICATION TSHWANELEX by GILLES-MAURICE DE SCHRYVER & DAVID JOFFE 55 • An advanced cross-reference system not only shows related (incoming and outgoing) cross-references of the current lemma, but also automatically updates target homonym and sense numbers when these change. • A filter function not only allows the user to work with a subset of lemmas in the dictionary based on specified criteria, a dictionary text search function further enables complex search queries on that filtered section using Unicode regular expressions. • A compare/merge feature visually displays differences between database versions, and allows changes to be selectively merged into the main database. In addition to paper, dictionaries can be published on the Web with the online dictionary module, which features a sophisticated query logging system. The localisable user interface allows users to browse the dictionary in their own language, and their preferred language may further be used to dynamically customise the language of the meta-language within returned articles. This feature is also extended to the electronic dictionary module. Full Unicode support Unicode, the international character set standard, is supported throughout TshwaneLex, and on all levels in the dictionary database. This allows not only the ability to enter data from virtually any language, but also even the simultaneous utilisation of both Asian and Latin characters in any attribute field in the database. For languages such as Chinese, Japanese or Korean, or say Arabic or Hebrew, data can be entered directly into TshwaneLex using any of the Input Method Editors (IMEs) available in Windows 2000 or XP. See Figure 1, which shows a screenshot of an elementary bilingual English-Chinese dictionary. Completely customisable sorting The default sorting method supported by TshwaneLex is a configurable fourpass table-based sorting system based on the ISO 14651 standard. The four different passes are used for various characteristics that may take precedence over one another, viz. the so-called ‘base alphabet’, diacritics, uppercase/lowercase differences, and socalled ‘ignorable’ characters (typically non-alphabetic characters such as spaces and punctuation marks). This is shown in Figure 2, where the sorting tables have been configured for the Estonian alphabet. TshwaneLex automatically takes care of the sorting of lemmas, thus freeing the lexicographer from having to do so. However, many different methods of sorting exist, and often many even for the same language, thus the question arises as to how to support any possible sorting method that may be desired. To solve this, TshwaneLex includes an extendibility mechanism whereby users can create plug-ins to add support for new sorting methods. As a result, any sorting system (e.g. by radical/stroke count or by pinyin romanised form for Chinese) may be used. ASIALEX 2005 WORDS IN ASIAN CULTURAL CONTEXTS 56 Figure 1. Unicode support for a bilingual English-Chinese dictionary in TshwaneLex Generating multiple dictionaries from a single database Elsewhere in this volume (cf. Joffe & De Schryver 2005) the main aspects of the customisable and multilayered DTD editor dialog are presented. Not only can the dictionary grammar for any project be flexibly configured and then kept under control with the built-in DTD, given that all elements and attributes are also linked to a comprehensive style system for generating the output (and preview), one single database can efficiently hold several dictionaries. Broadly speaking, this is achieved by doing two things: Firstly, by making use of multiple element ‘categories’ to which the various data attributes are assigned by the lexicographer depending on which dictionary or dictionaries they should appear in, and secondly by defining a different set of styles for each ‘view’ of the database, i.e. for each dictionary. Certain element categories are made visible or invisible in each style, which thus effectively functions as a kind of “mask” that filters and reveals only the portions of data to be shown for the current dictionary. Additionally, this also allows a different ‘look’ to be defined for each dictionary. These features are illustrated in Figures 3 and 4, which respectively show the desktop and pocket editions of a bidirectional French-Dutch dictionary (© 2005 Van Dale Lexicografie). One hotkey allows the lexicographer to switch between the two views, and thus also the two dictionaries. The extent of co-text and context for the production of any particular dictionary may thus easily be decided on at the output stage. ONE DATABASE, MANY DICTIONARIES — VARYING CO(N)TEXT WITH THE DICTIONARY APPLICATION TSHWANELEX by GILLES-MAURICE DE SCHRYVER & DAVID JOFFE 57 Figure 2. Configuring table-based sorting for Estonian in TshwaneLex This feature may also be tied in with customising the language of the metalanguage, as described earlier, potentially being used to customise aspects of the dictionary output further according to the language of the target user of the dictionary. For example, in a bidirectional Japanese-English dictionary, the information in some fields may inherently be primarily only useful to either a Japanese or English mothertongue speaker. Lexicographers sometimes have to make editorial decisions and compromises based on assumptions about the language of the target market; by customising the output from a single database this need not be the case. In an electronic dictionary, one could take still other factors into account, such as the level of the user, presenting different views of the dictionary to beginner or advanced language learners. Linked View mode for bilingual dictionary editing Several innovative functions assist in bilingual dictionary compilation, such as side-by-side editing, automated reversal and Linked View mode. When in side-by-side editing mode, the screen is split in two down the middle, and the lexicographer can work on either side of a bilingual dictionary by simply moving between the windows. When in Linked View mode, as in the case of Figures 3 and 4, related articles in the reverse side of a bilingual dictionary are automatically displayed. For instance, from the left-hand side of Figure 3 one sees that the Dutch words ‘bagagedepot’, ‘statiegeld’, ‘lege fles’, ‘instructie’, ‘kwartierarrest’ and ‘(het) nablijven’ have been used as translation equivalents for the French word ‘consigne’. ASIALEX 2005 WORDS IN ASIAN CULTURAL CONTEXTS 58 Figure 3. Desktop-edition view of a bidirectional French-Dutch dictionary in TshwaneLex, from the same database as the pocket edition When in Linked View mode, TshwaneLex automatically shows all and only those articles that have these translation equivalents as lemma signs, in this case ‘bagagedepot’, ‘statiegeld’, ‘instructie’ and ‘nablijven’ as may be seen from the righthand side of Figure 3. The Linked View mode feature thus allows the lexicographer to attempt to honour the reversibility principle, that is, the condition whereby all lexical items presented as lemma signs or translation equivalents in the X-Y section of a dictionary are respectively translation equivalents and lemma signs in the Y-X section of the dictionary (cf. e.g. Tomaszczyk 1988: 290; Gouws 1989: 162; Gouws 1996: 80). The reversibility principle has always been a crucial but hitherto little-looked into requirement in lexicography, now easily made accessible in TshwaneLex. ONE DATABASE, MANY DICTIONARIES — VARYING CO(N)TEXT WITH THE DICTIONARY APPLICATION TSHWANELEX by GILLES-MAURICE DE SCHRYVER & DAVID JOFFE 59 Figure 4. Pocket-edition view of a bidirectional French-Dutch dictionary in TshwaneLex, from the same database as the desktop edition
منابع مشابه
Dictionary of Abstract and Concrete Words of the Russian Language: A Methodology for Creation and Application
The paper describes the first stage of a project on creating an electronic dictionary with numerical estimates of the degree of abstractness and concreteness of Russian words. Our approach is to integrate data obtained from several different sources: text corpora, psycholinguistic experiments, published dictionaries, markers of abstractness (certain suffixes) and a translation of a similar dict...
متن کاملThe users and uses of TshwaneLex One
Ten months after the release of the dictionary compilation software TshwaneLex 1.0, and just days away from the launch of TshwaneLex 2.0, this paper presents a snapshot of the various users and uses of TshwaneLex to date.
متن کاملPPInterFinder—a mining tool for extracting causal relations on human proteins from literature
One of the most common and challenging problem in biomedical text mining is to mine protein-protein interactions (PPIs) from MEDLINE abstracts and full-text research articles because PPIs play a major role in understanding the various biological processes and the impact of proteins in diseases. We implemented, PPInterFinder--a web-based text mining tool to extract human PPIs from biomedical lit...
متن کاملInternationalisation, Localisation and Customisation Aspects of the Dictionary Application TshwaneLex
TshwaneLex is the world's only lexicography software suite with which the entire lexicographic process, from initial compilation all the way to final product, may be conducted in the language of one's choice. This is possible thanks to various aspects of internationalisation, localisation and customisation that are built into TshwaneLex. These are discussed by means of examples drawn from a wid...
متن کاملBilingual phrase-to-phrase alignment for arbitrarily-small datasets
This paper presents a novel system for sub-sentential alignment of bilingual sentence pairs, however few, using readily-available machine-readable bilingual dictionaries. Performance is evaluated against an existing gold-standard parallel corpus where word alignments are annotated, showing results that are a considerable improvement on a comparable system and on GIZA++ performance for the same ...
متن کامل